This session covers:
The following instructions will take you through the process of installing Python 3 on your local macOS machine via command line. Before you begin, ensure that you are connected to the internet and to a power source. This process may take around 15-30 minutes.
How To Install Jupyter Notebook on Mac and Windows
This analysis will require the following libraries:
If you receive the error "No module named X" when running the Import Libraries code below, open the terminal to install the missing packing using the command:
pip3 install X
Libraries contain additional functions that are not part of the core Python code, but can be used to support our analysis. Before we can use the functions within a library we must first install, then import the library into our environment.
It is common to alias a library when importing it, to simplify our code.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Pandas allows the user to customize its default options and settings.
Show All Columns
When dealing with a dataset that contains many columns, Jupyter will automatically collapse a number of columns in the middle of the DataFrame. We can use the following command to update the environment's configuration to display all DataFrame columns and force horizontal scrolling.
pd.set_option("display.max.columns", None)
Show # of Decimal Places
We can also update the number of decimal places we would like to display for numeric columns. This can be useful if we have a large number of columns.
pd.set_option("display.precision", 3)
Next, we will load the data we would like to analyze. We will be working with the popular Iris dataset from the UC Irvine Machine Learning Repository, which contains 3 iris species as well as a few properties about each flower. This file can be found in the data folder.
Load Data
We can use the Pandas read_csv method to load a CSV file from a particular file path into a DataFrame object.
Similarly, the read_excel method can be used to read a single sheet or a list of sheets from an Excel file.
View Data
We can use Pandas head to "peek" at the first 5 rows of the DataFrame. It is good practice to use the head( ) method on our imported data to ensure that 1) it is the correct dataset, and 2) it has been imported correctly.
Similarly, we can use the tail method to view the last 5 rows of the DataFrame.
We can also specify a certain number of rows to display, ex. df.head(2)
Note: It is also possible to connect to and read data directly from a database.
df = pd.read_csv("data/iris_data.csv")
df.head()
| sepal_length | sepal_width | petal_length | petal_width | variety | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | Setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | Setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | Setosa |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | Setosa |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | Setosa |
We can also peek at a single column.
# Notation 1
df["sepal_width"].head()
# Notation 2
df.sepal_width.head()
0 3.5 1 3.0 2 3.2 3 3.1 4 3.6 Name: sepal_width, dtype: float64
When working with a new (or even familiar!) dataset, the first step should be data exploration. Data exploration helps us to better understand our data, summarize our dataset, and identify issues like missing or unnatural values.
Python offers a number of methods to help us explore our dataset.
Find DataFrame Dimensions
Accessing the shape property returns a tuple that contains the number of rows and columns present in our DataFrame.
Note: A tuple is an ordered set of values (ex. An n-tuple contains n values)
df.shape
(150, 5)
Summary Statistics
The describe method generates descriptive statistics about the numerical columns in our dataset.
df.describe()
| sepal_length | sepal_width | petal_length | petal_width | |
|---|---|---|---|---|
| count | 150.000 | 150.000 | 150.000 | 150.000 |
| mean | 5.843 | 3.057 | 3.758 | 1.199 |
| std | 0.828 | 0.436 | 1.765 | 0.762 |
| min | 4.300 | 2.000 | 1.000 | 0.100 |
| 25% | 5.100 | 2.800 | 1.600 | 0.300 |
| 50% | 5.800 | 3.000 | 4.350 | 1.300 |
| 75% | 6.400 | 3.300 | 5.100 | 1.800 |
| max | 7.900 | 4.400 | 6.900 | 2.500 |
We can also use Pandas and NumPy methods to generate individual statistics.
Examples include: mean( ), median( ), min( ), max( ), sum( ), std( )
# Get the average values for each numerical column
df.mean()
sepal_length 5.843 sepal_width 3.057 petal_length 3.758 petal_width 1.199 dtype: float64
# Get the minimum sepal length in the dataset
np.min(df.sepal_length)
4.3
# Get the total sum of the petal width column
np.sum(df.petal_width)
179.90000000000003
Info
The info method provides a summary of all columns, data types, and the number of Null (empty) values. This is useful if your data also contains non-numeric columns.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 150 entries, 0 to 149 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 sepal_length 150 non-null float64 1 sepal_width 150 non-null float64 2 petal_length 150 non-null float64 3 petal_width 150 non-null float64 4 variety 150 non-null object dtypes: float64(4), object(1) memory usage: 6.0+ KB
Column List
If we have a dataset containing many columns, it can be helpful to view the entire list of columns.
df.columns
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
'variety'],
dtype='object')
Find Duplicate Entries
Use the duplicated method to check if an exact entry appears in our dataset multiple times. Based on the goals of our analysis, we may want to either keep or remove duplicate rows from our data.
df[df.duplicated()]
| sepal_length | sepal_width | petal_length | petal_width | variety | |
|---|---|---|---|---|---|
| 142 | 5.8 | 2.7 | 5.1 | 1.9 | Virginica |
Unique Values
The unique method returns the unique values of a series object. A series is a 1D array with axis labels. A single DataFrame column is a series.
If our data has categorical columns, it can be useful to retrieve a list of the unique values of a particular category.
df.variety.unique()
array(['Setosa', 'Versicolor', 'Virginica'], dtype=object)
Value Counts
We can use the value_counts method to determine the number of times each unique value appears in a Series.
df.variety.value_counts()
Virginica 50 Setosa 50 Versicolor 50 Name: variety, dtype: int64
Count Missing Values
Count the number of missing values in each column. In this case, we are not missing any data.
df.isnull().sum()
sepal_length 0 sepal_width 0 petal_length 0 petal_width 0 variety 0 dtype: int64
Likewise, we can count the number of non-empty values in each column!
df.notnull().sum()
sepal_length 150 sepal_width 150 petal_length 150 petal_width 150 variety 150 dtype: int64
Pandas Profiling is an extremely useful package that generates an interative, all-in-one report on our dataset.
Note: This requires the installation of this additional library:
pip3 install pandas-profiling
import pandas_profiling as pp
pp.ProfileReport(df)
<ipython-input-18-ef47fb68451c>:1: DeprecationWarning: `import pandas_profiling` is going to be deprecated by April 1st. Please use `import ydata_profiling` instead. import pandas_profiling as pp
Matplotlib and Seaborn are two of the most popular libraries for visualization in Python. While Matplotlib is useful for plotting basic charts, Seaborn can create "prettier" charts with less syntax. Plotly can be used to make interactive plots.
The following resources provide a detailed overview of common plotting techniques and how to apply them to the Iris dataset.
Scatter Plots
# Use Matplotlib to create a basic scatter plot
%matplotlib inline
plt.scatter(df["sepal_length"], df["sepal_width"])
# Add labels and title
plt.xlabel("Sepal Length")
plt.ylabel("Sepal Width")
plt.title("Scatter Plot")
# Display the plot
plt.show()
# Scatter Plot, grouping category by color
fig, ax = plt.subplots()
# Define color map
colors = {'Setosa':'red', 'Versicolor':'green', 'Virginica':'blue'}
grouped = df.groupby('variety')
for key, group in grouped:
group.plot(ax=ax, kind='scatter', x='sepal_length', y='sepal_width', label=key, color=colors[key])
# Add labels and title
plt.xlabel("Sepal Length")
plt.ylabel("Sepal Width")
plt.title("Scatter Plot")
plt.show()
Box Plots
Using Seaborn.
sns.catplot(data=df, x="variety", y="petal_length", kind="box")
<seaborn.axisgrid.FacetGrid at 0x7ffcc5277ca0>
Histograms
df.petal_width.hist(bins = 20)
<AxesSubplot:>
plt.hist(df['petal_width'], bins = 20)
(array([34., 7., 7., 1., 1., 0., 0., 7., 3., 5., 21., 12., 4.,
2., 12., 11., 6., 3., 8., 6.]),
array([0.1 , 0.22, 0.34, 0.46, 0.58, 0.7 , 0.82, 0.94, 1.06, 1.18, 1.3 ,
1.42, 1.54, 1.66, 1.78, 1.9 , 2.02, 2.14, 2.26, 2.38, 2.5 ]),
<BarContainer object of 20 artists>)
Use the Plotly package to create interactive plots. Try hovering over a point on the plots below.
import plotly.express as px
# Scatter plot with variety column as color and sepal_length column as point size
fig1 = px.scatter(df, x ="petal_width", y="petal_length", color ="variety",
size='sepal_length', hover_data=['sepal_width'])
fig1.show()
# Scatter plot with color range
fig2 = px.scatter(df, x="sepal_width", y="sepal_length", color='petal_length')
fig2.show()
# Scatter plot with variety defining marker color and shape
fig3 = px.scatter(df, x="sepal_width", y="sepal_length", color="variety", symbol="variety")
fig3.show()
df2 = df.copy()
df2["petal_area"] = df2.petal_length * df2.petal_width
df2["petal_area_gtr_sepal_len"] = np.where(df2["petal_area"] > df2["sepal_length"], True, False)
# Scatter Plot with facetted columns
# Market color displays if petal area is greater than sepal length
fig4 = px.scatter(df2, x="petal_length", y="petal_width", color="petal_area_gtr_sepal_len", facet_col="variety")
fig4.show()
Filtering allows us to create a subset of our data that meets a certain condition or list of conditions. We can create filters using Boolean expressions, which assign True/False values based on whether the conditions are met.
This is similar to conditional WHERE or HAVING statements in SQL.
df[df["petal_length"] > df["sepal_width"]]
| sepal_length | sepal_width | petal_length | petal_width | variety | |
|---|---|---|---|---|---|
| 50 | 7.0 | 3.2 | 4.7 | 1.4 | Versicolor |
| 51 | 6.4 | 3.2 | 4.5 | 1.5 | Versicolor |
| 52 | 6.9 | 3.1 | 4.9 | 1.5 | Versicolor |
| 53 | 5.5 | 2.3 | 4.0 | 1.3 | Versicolor |
| 54 | 6.5 | 2.8 | 4.6 | 1.5 | Versicolor |
| ... | ... | ... | ... | ... | ... |
| 145 | 6.7 | 3.0 | 5.2 | 2.3 | Virginica |
| 146 | 6.3 | 2.5 | 5.0 | 1.9 | Virginica |
| 147 | 6.5 | 3.0 | 5.2 | 2.0 | Virginica |
| 148 | 6.2 | 3.4 | 5.4 | 2.3 | Virginica |
| 149 | 5.9 | 3.0 | 5.1 | 1.8 | Virginica |
100 rows × 5 columns
Equality Statements
We can use a boolean equality statement to identify the DataFrame rows where the iris variety is Versicolor. This expression returns a vector of True/False values that correspond to each row in our DataFrame.
# Notation 1
df["variety"] == "Versicolor"
# Notation 2
df.variety == "Versicolor"
0 False
1 False
2 False
3 False
4 False
...
145 False
146 False
147 False
148 False
149 False
Name: variety, Length: 150, dtype: bool
We can use this expression to filter our DataFrame and retrieve the subset of rows where the boolean expression is True. The following command creates a new DataFrame that contains the subset of rows that met the filter condition.
versicolor_df = df[df["variety"] == "Versicolor"]
versicolor_df.head()
| sepal_length | sepal_width | petal_length | petal_width | variety | |
|---|---|---|---|---|---|
| 50 | 7.0 | 3.2 | 4.7 | 1.4 | Versicolor |
| 51 | 6.4 | 3.2 | 4.5 | 1.5 | Versicolor |
| 52 | 6.9 | 3.1 | 4.9 | 1.5 | Versicolor |
| 53 | 5.5 | 2.3 | 4.0 | 1.3 | Versicolor |
| 54 | 6.5 | 2.8 | 4.6 | 1.5 | Versicolor |
Not Equal Statements
Filter where DataFrame rows are not equal to some value.
not_df = df[(df["variety"] != "Versicolor")]
not_df.head(2)
| sepal_length | sepal_width | petal_length | petal_width | variety | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | Setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | Setosa |
In List
Filter DataFrame rows whose value is present in a list.
variety_list = ["Setosa","Versicolor"]
df[df["variety"].isin(variety_list)]
| sepal_length | sepal_width | petal_length | petal_width | variety | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | Setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | Setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | Setosa |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | Setosa |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | Setosa |
| ... | ... | ... | ... | ... | ... |
| 95 | 5.7 | 3.0 | 4.2 | 1.2 | Versicolor |
| 96 | 5.7 | 2.9 | 4.2 | 1.3 | Versicolor |
| 97 | 6.2 | 2.9 | 4.3 | 1.3 | Versicolor |
| 98 | 5.1 | 2.5 | 3.0 | 1.1 | Versicolor |
| 99 | 5.7 | 2.8 | 4.1 | 1.3 | Versicolor |
100 rows × 5 columns
Null Values
Filter DataFrame rows where a value is Null.
Note: Only remove Null values if it will not negatively affect the results of your analysis!
Things to Consider When Dropping Data:
df[df["sepal_length"].isnull()]
| sepal_length | sepal_width | petal_length | petal_width | variety |
|---|
Filter DataFrame rows where a value is not null. This is useful if we would like to remove rows with empty or missing values.
df[df["sepal_length"].notnull()]
| sepal_length | sepal_width | petal_length | petal_width | variety | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | Setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | Setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | Setosa |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | Setosa |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | Setosa |
| ... | ... | ... | ... | ... | ... |
| 145 | 6.7 | 3.0 | 5.2 | 2.3 | Virginica |
| 146 | 6.3 | 2.5 | 5.0 | 1.9 | Virginica |
| 147 | 6.5 | 3.0 | 5.2 | 2.0 | Virginica |
| 148 | 6.2 | 3.4 | 5.4 | 2.3 | Virginica |
| 149 | 5.9 | 3.0 | 5.1 | 1.8 | Virginica |
150 rows × 5 columns
We can combine boolean expressions to create more complex filters. Each condition MUST be contained within its own set of parenthesis.
AND Statements
Filter where DataFrame rows meet both conditions.
and_df = df[(df["variety"] == "Versicolor") & (df["petal_length"] > 4)]
and_df.head(3)
| sepal_length | sepal_width | petal_length | petal_width | variety | |
|---|---|---|---|---|---|
| 50 | 7.0 | 3.2 | 4.7 | 1.4 | Versicolor |
| 51 | 6.4 | 3.2 | 4.5 | 1.5 | Versicolor |
| 52 | 6.9 | 3.1 | 4.9 | 1.5 | Versicolor |
OR Statements
Filter where DataFrame rows meet one of the conditions.
or_df = df[(df["variety"] == "Versicolor") | (df["variety"] == "Setosa")]
or_df.head(2)
| sepal_length | sepal_width | petal_length | petal_width | variety | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | Setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | Setosa |
Long Expressions
If we have a long expression, we can define a new variable to hold the filter condition and then pass it to our DataFrame.
filter_condition = ((df["variety"] == "Versicolor") | (df["variety"] == "Setosa")) & (df["sepal_length"] >= 6.7)
df[filter_condition]
| sepal_length | sepal_width | petal_length | petal_width | variety | |
|---|---|---|---|---|---|
| 50 | 7.0 | 3.2 | 4.7 | 1.4 | Versicolor |
| 52 | 6.9 | 3.1 | 4.9 | 1.5 | Versicolor |
| 65 | 6.7 | 3.1 | 4.4 | 1.4 | Versicolor |
| 76 | 6.8 | 2.8 | 4.8 | 1.4 | Versicolor |
| 77 | 6.7 | 3.0 | 5.0 | 1.7 | Versicolor |
| 86 | 6.7 | 3.1 | 4.7 | 1.5 | Versicolor |
The sort_values is used to sort a DataFrame in either ascending or descending order based on the values of a column or columns.
df.sort_values(by=['sepal_length'], ascending = False)
| sepal_length | sepal_width | petal_length | petal_width | variety | |
|---|---|---|---|---|---|
| 131 | 7.9 | 3.8 | 6.4 | 2.0 | Virginica |
| 135 | 7.7 | 3.0 | 6.1 | 2.3 | Virginica |
| 122 | 7.7 | 2.8 | 6.7 | 2.0 | Virginica |
| 117 | 7.7 | 3.8 | 6.7 | 2.2 | Virginica |
| 118 | 7.7 | 2.6 | 6.9 | 2.3 | Virginica |
| ... | ... | ... | ... | ... | ... |
| 41 | 4.5 | 2.3 | 1.3 | 0.3 | Setosa |
| 42 | 4.4 | 3.2 | 1.3 | 0.2 | Setosa |
| 38 | 4.4 | 3.0 | 1.3 | 0.2 | Setosa |
| 8 | 4.4 | 2.9 | 1.4 | 0.2 | Setosa |
| 13 | 4.3 | 3.0 | 1.1 | 0.1 | Setosa |
150 rows × 5 columns
Sort by multiple columns.
df.sort_values(by=['sepal_length', 'variety'], ascending = True)
| sepal_length | sepal_width | petal_length | petal_width | variety | |
|---|---|---|---|---|---|
| 13 | 4.3 | 3.0 | 1.1 | 0.1 | Setosa |
| 8 | 4.4 | 2.9 | 1.4 | 0.2 | Setosa |
| 38 | 4.4 | 3.0 | 1.3 | 0.2 | Setosa |
| 42 | 4.4 | 3.2 | 1.3 | 0.2 | Setosa |
| 41 | 4.5 | 2.3 | 1.3 | 0.3 | Setosa |
| ... | ... | ... | ... | ... | ... |
| 117 | 7.7 | 3.8 | 6.7 | 2.2 | Virginica |
| 118 | 7.7 | 2.6 | 6.9 | 2.3 | Virginica |
| 122 | 7.7 | 2.8 | 6.7 | 2.0 | Virginica |
| 135 | 7.7 | 3.0 | 6.1 | 2.3 | Virginica |
| 131 | 7.9 | 3.8 | 6.4 | 2.0 | Virginica |
150 rows × 5 columns
# Get the minimum sepal width for each iris variety
df.groupby("variety")["sepal_width"].min()
variety Setosa 2.3 Versicolor 2.0 Virginica 2.2 Name: sepal_width, dtype: float64
# Get the mean and median of each property by variety
df.groupby("variety").agg(["mean", "median"])
| sepal_length | sepal_width | petal_length | petal_width | |||||
|---|---|---|---|---|---|---|---|---|
| mean | median | mean | median | mean | median | mean | median | |
| variety | ||||||||
| Setosa | 5.006 | 5.0 | 3.428 | 3.4 | 1.462 | 1.50 | 0.246 | 0.2 |
| Versicolor | 5.936 | 5.9 | 2.770 | 2.8 | 4.260 | 4.35 | 1.326 | 1.3 |
| Virginica | 6.588 | 6.5 | 2.974 | 3.0 | 5.552 | 5.55 | 2.026 | 2.0 |
# Get multiple summary statistics on one column, by variety
df.groupby("variety").agg({"petal_length": [sum, min, max]})
| petal_length | |||
|---|---|---|---|
| sum | min | max | |
| variety | |||
| Setosa | 73.1 | 1.0 | 1.9 |
| Versicolor | 213.0 | 3.0 | 5.1 |
| Virginica | 277.6 | 4.5 | 6.9 |
# Create custom names for the summary variables
df.groupby("variety")["petal_length"].agg(petal_length_mean = 'mean', petal_length_sum = 'sum')
| petal_length_mean | petal_length_sum | |
|---|---|---|
| variety | ||
| Setosa | 1.462 | 73.1 |
| Versicolor | 4.260 | 213.0 |
| Virginica | 5.552 | 277.6 |
# Get summary statistics for specific columns.
df.groupby("variety")[["sepal_length","sepal_width"]].agg(['sum','count'])
| sepal_length | sepal_width | |||
|---|---|---|---|---|
| sum | count | sum | count | |
| variety | ||||
| Setosa | 250.3 | 50 | 171.4 | 50 |
| Versicolor | 296.8 | 50 | 138.5 | 50 |
| Virginica | 329.4 | 50 | 148.7 | 50 |
Use pd.NamedAgg for column specific aggregation.
df.groupby('variety').agg(
max_sepal_length = pd.NamedAgg(column='sepal_length', aggfunc=max),
min_sepal_length = pd.NamedAgg(column='sepal_length', aggfunc=min),
)
| max_sepal_length | min_sepal_length | |
|---|---|---|
| variety | ||
| Setosa | 5.8 | 4.3 |
| Versicolor | 7.0 | 4.9 |
| Virginica | 7.9 | 4.9 |
Adding Columns
We can create new columns by performing an action on the column.
df["petal_length_max"] = df.petal_length.max()
df.head()
| sepal_length | sepal_width | petal_length | petal_width | variety | petal_length_max | |
|---|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | Setosa | 6.9 |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | Setosa | 6.9 |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | Setosa | 6.9 |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | Setosa | 6.9 |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | Setosa | 6.9 |
Or, we can create new columns as combinations of the existing columns.
df["petal_area"] = df.petal_length * df.petal_width
df.head(2)
| sepal_length | sepal_width | petal_length | petal_width | variety | petal_length_max | petal_area | |
|---|---|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | Setosa | 6.9 | 0.28 |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | Setosa | 6.9 | 0.28 |
Assign columns a based on whether they meet some condition.
df["petal_len_greater_sepal_wd"] = np.where(df["petal_length"] > df["sepal_width"], True, False)
df.head(5)
| sepal_length | sepal_width | petal_length | petal_width | variety | petal_length_max | petal_area | petal_len_greater_sepal_wd | |
|---|---|---|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | Setosa | 6.9 | 0.28 | False |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | Setosa | 6.9 | 0.28 | False |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | Setosa | 6.9 | 0.26 | False |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | Setosa | 6.9 | 0.30 | False |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | Setosa | 6.9 | 0.28 | False |
Dropping Columns
We can also drop unnecessary columns from our DataFrame.
df = df.drop(["petal_area", "petal_length_max","petal_len_greater_sepal_wd"], axis = 1)
df.head(2)
| sepal_length | sepal_width | petal_length | petal_width | variety | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | Setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | Setosa |
Dropping Duplicate Rows
The drop_duplicates method can be used return a DataFrame with duplicate rows removed.
df.drop_duplicates()
| sepal_length | sepal_width | petal_length | petal_width | variety | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | Setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | Setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | Setosa |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | Setosa |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | Setosa |
| ... | ... | ... | ... | ... | ... |
| 145 | 6.7 | 3.0 | 5.2 | 2.3 | Virginica |
| 146 | 6.3 | 2.5 | 5.0 | 1.9 | Virginica |
| 147 | 6.5 | 3.0 | 5.2 | 2.0 | Virginica |
| 148 | 6.2 | 3.4 | 5.4 | 2.3 | Virginica |
| 149 | 5.9 | 3.0 | 5.1 | 1.8 | Virginica |
149 rows × 5 columns
Retrieve Column Subset
We can retrieve a subset of columns from our DataFrame.
subset_df = df[["sepal_length", "sepal_width", "variety"]]
subset_df.head(2)
| sepal_length | sepal_width | variety | |
|---|---|---|---|
| 0 | 5.1 | 3.5 | Setosa |
| 1 | 4.9 | 3.0 | Setosa |
Retrieve Rows
The .loc attribute retrieves the row at a provided DataFrame index.
df.loc[2]
sepal_length 4.7 sepal_width 3.2 petal_length 1.3 petal_width 0.2 variety Setosa Name: 2, dtype: object
# Create a new DataFrame
iris_species = pd.DataFrame([{'variety' : 'Setosa', 'species_num' : 1},
{'variety':'Versicolor','species_num': 2},
{'variety':'Sirginica','species_num': 3}])
iris_species
| variety | species_num | |
|---|---|---|
| 0 | Setosa | 1 |
| 1 | Versicolor | 2 |
| 2 | Sirginica | 3 |
pd.merge(df, iris_species, on = "variety")
| sepal_length | sepal_width | petal_length | petal_width | variety | species_num | |
|---|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | Setosa | 1 |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | Setosa | 1 |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | Setosa | 1 |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | Setosa | 1 |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | Setosa | 1 |
| ... | ... | ... | ... | ... | ... | ... |
| 95 | 5.7 | 3.0 | 4.2 | 1.2 | Versicolor | 2 |
| 96 | 5.7 | 2.9 | 4.2 | 1.3 | Versicolor | 2 |
| 97 | 6.2 | 2.9 | 4.3 | 1.3 | Versicolor | 2 |
| 98 | 5.1 | 2.5 | 3.0 | 1.1 | Versicolor | 2 |
| 99 | 5.7 | 2.8 | 4.1 | 1.3 | Versicolor | 2 |
100 rows × 6 columns
Goal 1: For each iris variety, how many entries have a sepal length of greater than 5.5?
df[df.sepal_length > 5.5].groupby("variety")["variety"].count()
variety Setosa 3 Versicolor 39 Virginica 49 Name: variety, dtype: int64
Goal 2: What is the total sum of Versicolor petal areas?
df["petal_area"] = df.petal_length * df.petal_width
df["petal_area"][df["variety"] == "Versicolor"].sum()
286.02
Goal 3: What is the minimum sepal width for each iris variety?
df.groupby("variety")["sepal_width"].min()
variety Setosa 2.3 Versicolor 2.0 Virginica 2.2 Name: sepal_width, dtype: float64
Goal 4: On average, which iris variety has the smallest petal area?
df.groupby("variety")["petal_area"].mean().reset_index().sort_values(["petal_area"], ascending = True)
| variety | petal_area | |
|---|---|---|
| 0 | Setosa | 0.366 |
| 1 | Versicolor | 5.720 |
| 2 | Virginica | 11.296 |
Goal 5: Normalize a column to a common scale.
df["petal_length_max"] = df.petal_length.max()
df["petal_length_min"] = df.petal_length.min()
df["petal_length_norm"] = (df["petal_length"] - df["petal_length_min"]) / (df["petal_length_max"] - df["petal_length_min"])
df
| sepal_length | sepal_width | petal_length | petal_width | variety | petal_area | petal_length_max | petal_length_min | petal_length_norm | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | Setosa | 0.28 | 6.9 | 1.0 | 0.068 |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | Setosa | 0.28 | 6.9 | 1.0 | 0.068 |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | Setosa | 0.26 | 6.9 | 1.0 | 0.051 |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | Setosa | 0.30 | 6.9 | 1.0 | 0.085 |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | Setosa | 0.28 | 6.9 | 1.0 | 0.068 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 145 | 6.7 | 3.0 | 5.2 | 2.3 | Virginica | 11.96 | 6.9 | 1.0 | 0.712 |
| 146 | 6.3 | 2.5 | 5.0 | 1.9 | Virginica | 9.50 | 6.9 | 1.0 | 0.678 |
| 147 | 6.5 | 3.0 | 5.2 | 2.0 | Virginica | 10.40 | 6.9 | 1.0 | 0.712 |
| 148 | 6.2 | 3.4 | 5.4 | 2.3 | Virginica | 12.42 | 6.9 | 1.0 | 0.746 |
| 149 | 5.9 | 3.0 | 5.1 | 1.8 | Virginica | 9.18 | 6.9 | 1.0 | 0.695 |
150 rows × 9 columns